摘要 :
As more and more data is provided in RDF format, storing huge amounts of RDF data and efficiently processing queries on such data is becoming increasingly important. The first part of the lecture will introduce state-of-the-art te...
展开
As more and more data is provided in RDF format, storing huge amounts of RDF data and efficiently processing queries on such data is becoming increasingly important. The first part of the lecture will introduce state-of-the-art techniques for scalably storing and querying RDF with relational systems, including alternatives for storing RDF, efficient index structures, and query optimization techniques. As centralized RDF repositories have limitations in scalability and failure tolerance, decentralized architectures have been proposed. The second part of the lecture will highlight system architectures and strategies for distributed RDF processing. We cover search engines as well as federated query processing, highlight differences to classic federated database systems, and discuss efficient techniques for distributed query processing in general and for RDF data in particular. Moreover, for the last part of this chapter, we argue that extracting knowledge from the Web is an excellent showcase - and potentially one of the biggest challenges - for the scalable management of uncertain data we have seen so far. The third part of the lecture is thus intended to provide a close-up on current approaches and platforms to make reasoning (e.g., in the form of probabilistic inference) with uncertain RDF data scalable to billions of triples.
收起
摘要 :
As more and more data is provided in RDF format, storing huge amounts of RDF data and efficiently processing queries on such data is becoming increasingly important. The first part of the lecture will introduce state-of-the-art te...
展开
As more and more data is provided in RDF format, storing huge amounts of RDF data and efficiently processing queries on such data is becoming increasingly important. The first part of the lecture will introduce state-of-the-art techniques for scalably storing and querying RDF with relational systems, including alternatives for storing RDF, efficient index structures, and query optimization techniques. As central-ized RDF repositories have limitations in scalability and failure tolerance, decentralized architectures have been proposed. The second part of the lecture will highlight system architectures and strategies for distributed RDF processing. We cover search engines as well as federated query pro-cessing, highlight differences to classic federated database systems, and discuss efficient techniques for distributed query processing in general and for RDF data in particular. Moreover, for the last part of this chap-ter, we argue that extracting knowledge from the Web is an excellent showcase – and potentially one of the biggest challenges - for the scal-able management of uncertain data we have seen so far. The third part of the lecture is thus intended to provide a close-up on current approaches and platforms to make reasoning (e.g., in the form of probabilistic infer-ence) with uncertain RDF data scalable to billions of triples.
收起
摘要 :
One of the main goals of recent developments in the Smart Grid area is to increase the use of renewable energy sources. These sources are characterized by energy fluctuations that might lead to energy imbalances and congestions in...
展开
One of the main goals of recent developments in the Smart Grid area is to increase the use of renewable energy sources. These sources are characterized by energy fluctuations that might lead to energy imbalances and congestions in the electricity grid. Exploiting inherent flexibilities, which exist in both energy production and consumption, is the key to solving these problems. Flexibilities can be expressed as flex-offers, which due to their high number need to be aggregated to reduce the complexity of energy scheduling. In this paper, we discuss balance aggregation techniques that already during aggregation aim at balancing flexibilities in production and consumption to reduce the probability of congestions and reduce the complexity of scheduling. We present results of our extensive experiments.
收起
摘要 :
One of the main goals of recent developments in the Smart Grid area is to increase the use of renewable energy sources. These sources are characterized by energy fluctuations that might lead to energy imbalances and congestions in...
展开
One of the main goals of recent developments in the Smart Grid area is to increase the use of renewable energy sources. These sources are characterized by energy fluctuations that might lead to energy imbalances and congestions in the electricity grid. Exploiting inherent flexibilities, which exist in both energy production and consumption, is the key to solving these problems. Flexibilities can be expressed as flex-offers, which due to their high number need to be aggregated to reduce the complexity of energy scheduling. In this paper, we discuss balance aggregation techniques that already during aggregation aim at balancing flexibilities in production and consumption to reduce the probability of congestions and reduce the complexity of scheduling. We present results of our extensive experiments.
收起
摘要 :
The Semantic Web (SW) has drawn the attention of data enthusiasts, and also inspired the exploitation and design of multidimensional data warehouses, in an unconventional way. Traditional data ware-houses (DW) operate over static ...
展开
The Semantic Web (SW) has drawn the attention of data enthusiasts, and also inspired the exploitation and design of multidimensional data warehouses, in an unconventional way. Traditional data ware-houses (DW) operate over static data. However multidimensional (MD) data modeling approach can be dynamically extended by defining both the schema and instances of MD data as RDF graphs. The importance and applicability of MD data warehouses over RDF is widely studied yet none of the works support a spatially enhanced MD model on the SW. Spatial support in DWs is a desirable feature for enhanced analysis, since adding encoded spatial information of the data allows to query with spatial functions. In this paper we propose to empower the spatial dimension of data warehouses by adding spatial data types and topological relationships to the existing QB4OLAP vocabulary, which already supports the representation of the constructs of the MD models in RDF. With QB4SOLAP, spatial constructs of the MD models can be also published in RDF, which allows to implement spatial and metric analysis on spatial members along with OLAP operations. In our contribution, we describe a set of spatial OLAP (SOLAP) operations, demonstrate a spatially extended metamodel as, QB4SOLAP, and apply it on a use case scenario. Finally, we show how these SOLAP queries can be expressed in SPARQL.
收起
摘要 :
The Semantic Web (SW) has drawn the attention of data enthusiasts, and also inspired the exploitation and design of multidimensional data warehouses, in an unconventional way. Traditional data warehouses (DW) operate over static d...
展开
The Semantic Web (SW) has drawn the attention of data enthusiasts, and also inspired the exploitation and design of multidimensional data warehouses, in an unconventional way. Traditional data warehouses (DW) operate over static data. However multidimensional (MD) data modeling approach can be dynamically extended by defining both the schema and instances of MD data as RDF graphs. The importance and applicability of MD data warehouses over RDF is widely studied yet none of the works support a spatially enhanced MD model on the SW. Spatial support in DWs is a desirable feature for enhanced analysis, since adding encoded spatial information of the data allows to query with spatial functions. In this paper we propose to empower the spatial dimension of data warehouses by adding spatial data types and topolog-ical relationships to the existing QB4OLAP vocabulary, which already supports the representation of the constructs of the MD models in RDF. With QB4SOLAP, spatial constructs of the MD models can be also published in RDF, which allows to implement spatial and metric analysis on spatial members along with OLAP operations. In our contribution, we describe a set of spatial OLAP (SOLAP) operations, demonstrate a spatially extended metamodel as, QB4SOLAP, and apply it on a use case scenario. Finally, we show how these SOLAP queries can be expressed in SPARQL.
收起
摘要 :
The steadily-growing popularity of semantic data on the Web and the support for aggregation queries in SPARQL 1.1 have propelled the interest in Online Analytical Processing (OLAP) and data cubes in RDF. Query processing in such s...
展开
The steadily-growing popularity of semantic data on the Web and the support for aggregation queries in SPARQL 1.1 have propelled the interest in Online Analytical Processing (OLAP) and data cubes in RDF. Query processing in such settings is challenging because SPARQL OLAP queries usually contain many triple patterns with grouping and aggregation. Moreover, one important factor of query answering on Web data is its provenance, i.e., metadata about its origin. Some applications in data analytics and access control require to augment the data with provenance metadata and run queries that impose constraints on this provenance. This task is called provenance-aware query answering. In this paper, we investigate the benefit of caching some parts of an RDF cube augmented with provenance information when answering provenance-aware SPARQL queries. We propose provenance-aware caching (PAC), a caching approach based on a provenance-aware partitioning of RDF graphs, and a benefit model for RDF cubes and SPARQL queries with aggregation. Our results on real and synthetic data show that PAC outperforms significantly the LRU strategy (least recently used) and the Jena TDB native caching in terms of hit-rate and response time.
收起
摘要 :
The steadily-growing popularity of semantic data on the Web and the support for aggregation queries in SPARQL 1.1 have propelled the interest in Online Analytical Processing (OLAP) and data cubes in RDF. Query processing in such s...
展开
The steadily-growing popularity of semantic data on the Web and the support for aggregation queries in SPARQL 1.1 have propelled the interest in Online Analytical Processing (OLAP) and data cubes in RDF. Query processing in such settings is challenging because SPARQL OLAP queries usually contain many triple patterns with grouping and aggregation. Moreover, one important factor of query answering on Web data is its provenance, i.e., metadata about its origin. Some applications in data analytics and access control require to augment the data with provenance metadata and run queries that impose constraints on this provenance. This task is called provenance-aware query answering. In this paper, we investigate the benefit of caching some parts of an RDF cube augmented with provenance information when answering provenance-aware SPARQL queries. We propose provenance-aware caching (PAC), a caching approach based on a provenance-aware partitioning of RDF graphs, and a benefit model for RDF cubes and SPARQL queries with aggregation. Our results on real and synthetic data show that PAC outperforms significantly the LRU strategy (least recently used) and the Jena TDB native caching in terms of hit-rate and response time.
收起
摘要 :
Answering queries over a federation of SPARQL endpoints requires combining data from more than one data source. Optimizing queries in such scenarios is particularly challenging not only because of (i) the large variety of possible...
展开
Answering queries over a federation of SPARQL endpoints requires combining data from more than one data source. Optimizing queries in such scenarios is particularly challenging not only because of (i) the large variety of possible query execution plans that correctly answer the query but also because (ii) there is only limited access to statistics about schema and instance data of remote sources. Ib overcome these challenges, most federated query engines rely on heuristics to reduce the space of possible query execution plans or on dynamic programming strategies to produce optimal plans. Nevertheless, these plans may still exhibit a high number of intermediate results or high execution times because of heuristics and inaccurate cost estimations. In this paper, we present Odyssey, an approach that uses statistics that allow for a more accurate cost estimation for federated queries and therefore enables Odyssey to produce better query execution plans. Our experimental results show that Odyssey produces query execution plans that are better in terms of data transfer and execution time than state-of-the-art optimizers. Our experiments using the FedBench benchmark show execution time gains of at least 25 times on average.
收起
摘要 :
Answering queries over a federation of SPARQL endpoints requires combining data from more than one data source. Optimizing queries in such scenarios is particularly challenging not only because of (i) the large variety of possible...
展开
Answering queries over a federation of SPARQL endpoints requires combining data from more than one data source. Optimizing queries in such scenarios is particularly challenging not only because of (i) the large variety of possible query execution plans that correctly answer the query but also because (ii) there is only limited access to statistics about schema and instance data of remote sources. To overcome these challenges, most federated query engines rely on heuristics to reduce the space of possible query execution plans or on dynamic programming strategies to produce optimal plans. Nevertheless, these plans may still exhibit a high number of intermediate results or high execution times because of heuristics and inaccurate cost estimations. In this paper, we present Odyssey, an approach that uses statistics that allow for a more accurate cost estimation for federated queries and therefore enables Odyssey to produce better query execution plans. Our experimental results show that Odyssey produces query execution plans that are better in terms of data transfer and execution time than state-of-the-art optimizers. Our experiments using the FedBench benchmark show execution time gains of at least 25 times on average.
收起